Descriptor learning for omnidirectional image matching 



Jonathan Masci^'^'^ Davide Migliore^'^ 

jonathan@idsia . ch davide . migliore@gmail . com 

Michael M. Bronstein^ 

michael .bronstein@usi.ch 

Jiirgen Schmidhuber^'^'^ 

juergen@idsia . ch 

^Istituto Dalle MoUe di Studi sull'Intelligenza Artificiale (IDSIA), Manno, Switzerland 
^Faculty of Informatics, Universita della Svizzera Italiana (USI), Lugano, Switzerland 
^Scuola Universitaria Professional della Svizzera Italiana (SUPSI), Lugano, Switzerland 

^ Evidence Sri, Pisa, Italy 

Abstract 

Feature matching in omnidirectional vision systems is a challenging problem, mainly because compli- 
cated optical systems make the theoretical modelling of invariance and construction of invariant feature 
descriptors hard or even impossible. In this paper, we propose learning invariant descriptors using a 
training set of similar and dissimilar descriptor pairs. We use the similarity-preserving hashing frame- 
work, in which we are trying to map the descriptor data to the Hamming space preserving the descriptor 
similarity on the training set. A neural network is used to solve the underlying optimization prob- 
lem. Our approach outperforms not only straightforward descriptor matching, but also state-of-the-art 
similarity -pre serving hashing methods. 

1. Introduction 

Feature-based matching between images has become a standard approach in computer vision literature 
in the last decade, in many respects due to the introduction of stable and invariant feature detection and 
description algorithms such as SIFT |[23]| and similar methods (271121 EH. The usual assumption guiding 
the design of feature descriptors is invariance across viewpoints, which should guarantee that the same 
feature appearing in two different views has the same descriptor. Since perspective transformations are 
approximately locally affine, it is common to construct affine-invariant descriptors II2TII . 

While being a good model in many cases, affine invariance is not sufficiently accurate in cases of wide 
baseline (very different view points) or even more complicated setting of optical imperfections such as 



lens distortions, blur, etc. In particular, in omnidirectional vision systems the distortion is introduced 
intentionally (e.g., using a parabolic mirror [25J) to allow a 360° view. Designing invariant descriptors 
for such cases is challenging, as the invariance is complicated and cannot be easily modeled. 

An alternative to 'invariance-by-construction' approaches which rely on a simplified invariance model 
is to learn the descriptor invariance from examples. Recent work of Strecha et al |l35l| showed very 
convincingly that such approaches can significantly improve the performance of existing descriptors. 

In this paper, we consider the learning of invariant descriptors for omnidirectional image matching. 
We construct a training set of similar and dissimilar descriptor pairs including strong optical distortions, 
and use a neural network to learn a mapping from the descriptor space to the Hamming space preserv- 
ing similarity on the training set. Experimental results show that our approach outperforms not only 
straightforward descriptors, but also other similarity-preserving hashing methods. The latter observation 
is explained by the suboptimality of existing approaches which solve a simplified optimization problem. 

The main contribution of this paper is two-fold. First, we formulate a new similarity- sensitive hash- 
ing algorithm. Second, we use this approach to learn smaller invariant descriptors suitable for feature 
matching in omnidirectional images. The rest of the paper is organized as follows. In Section 2, we 
overview the related works. Section 3 is dedicated to metric learning and similarity-preserving hashing 
methods. In Section 4, we describe our NNhash approach. Section 5 contains experimental results. 
Finally, Section 6 discusses potential future work and concludes the paper. 

2. Background 

Although feature-based correspondence problems have been investigated in depth for standard per- 
spective cameras, omnidirectional image matching still remains an open problem, largely because of 
the complicated geometry introduced by lenses and curved mirrors. Broadly speaking, the existing ap- 
proaches either try to reduce the problem to the simpler perspective setting, or design special descriptors 
suitable for omnidirectional images. 

Svoboda et al [36] proposed to use adaptive windows around interest points to generate normalized 
patches with the assumption that the displacement of the omnidirectional system is smaller than the depth 
of the surrounding scene. Nayar |[28]| showed that, given the mirror parameters, it is possible to generate 
a perspective version of the omnidirectional image and Mauthner et al. |[24ll used this approach to 
generate perspective representation of each interest point region. This unwarping procedure removes the 
non-linear distortions and enables the use of algorithms designed for perspective cameras. Micusik and 
Pajdla [261 checked the candidate correspondences between two views using the RANSAC algorithm 
and the epipolar constraint lfT3 l. Construction of scale-space by means of diffusion on manifolds was 
used in [[3l[T6l[II]] for the construction of local descriptors. Puig et al. If29ll integrated the sphere camera 
model with the partial differential equations on manifolds framework. 

Another possible solution is to consider different kind of features to exploit particular invariance in 
omnidirectional systems, for example, extracting one-dimensional features [5j or vertical lines ll32ll and 
defining descriptors suitable for omnidirectional images. 

More recently, it was shown in ll35]| that one can approach the design of invariant descriptors from the 
perspective of metric learning, constructing a distance between the descriptor vectors from a training 
set of similar and dissimilar pairs [[B |42]]. In particular, similarity -preserving hashing methods [[HI 
[34l l43l l22l [30ll were found especially attractive for descriptor learning, as they significantly reduce 
descriptor storage and comparison complexity. These methods have also been applied to image search 



inillllllllllllllQlllI]], video copy detection 0, and shape retrieval H. 

In (31), binary codes were produced using a restricted Boltzmann machine and in (43] using spectral 
hashing in an unsupervised setting. The authors showed that the learnt binary vectors capture the similar- 
ities of the data. With such an approach it is however impossible to explicitly provide information about 
data similarities. Since in our problem it is easy to produce labeled data, supervised metric learning is 
advantageous. 

3. Similarity preserving hashing 

Given a set of keypoint descriptors, represented as n-dimensional vectors in M^, the problem of metric 
learning is to find their representation in some metric space (Z, rf^) by means of a map of the form 
y : ^ (Z, (i^). The metric rfz o (y x y) parametrizes the similarity between the feature descriptors, 
which may be difficult to compute in the original representation. Typically, (Z, (i^) is fixed and y is the 
map we are trying to find in such a way that, given a set V of pairs of descriptors from corresponding 
points in different images (positives) and a set A/" of pairs of descriptors from different points {negatives), 
we have (iz(y(x), y(x+)) ^ for all (x, x+) G V and (iz(y(x), y(x")) > for all (x, x") G M with 
high probability. 

A particular setting of this problem, where Z = {±1}^ is the m-dimensional space of binary strings 
and dw^ (y , y') = ^ — \ Y^^=i ^^S^iViVd ^he Hamming metric, the problem is referred to as similarity- 
preserving hashing. Here, we limit our attention to affine embeddings of the form 

y = sign(Px + t) , (1) 

where P is an m x n matrix and t is an m x 1 vector. Our goal is to find such P and t that minimize 
one of the following cost functions, 

Le(P,t) = E{y(x)V(x-) - «y(x)V(x+)}, or 
Ld(P,t) = E{a||y(x)-y(x+)||2-||y(x)-y(x-)f} 

for (x, x+) G V and (x, x") G JV. Both cost functions try to map positives as close as possible to 
each other (expressed as large correlations or small distance), and negatives as far as possible from each 
other (small correlation or large distance), in order to ensure low false positive (FPR) and false negative 
(FNR) rates, a > is a parameter determining the tradeoff between the FPR and FNR. In practice, the 
expectations are approximated as means on some sufficiently large training set. 

The problem minp tI/(P,t) is a non-linear non-convex optimization problem without an obvious 
simple solution. It is commonly approached by the following two-stage relaxation: first, approximate 
the map y ^ Px by removing the sign and the offset vectors, minimizing 

Lc(P) = E{(Px)^(Px-) -a(Px)^(Px+)}, or 
Ld(P) = E{a||P(x-x+)||2-||P(x-x-)||2} 

w.r.t. to P (introducing some regularization, e.g., P^P = I, in order to avoid a trivial solution P = 0). 
Second, fix P* = argminpI/(P) and solve t* = argmint L(P*, t) w.r.t. t. To further simplify the 
problem, it is also common to assume separability, thus solving independently for each dimension of 
the hash. 



3.1. Similarity-sensitive hashing (SSH) 

In ll34ll . the above strategy w^as used for the approximate minimization of the cost Lc. The computation 
of optimal parameters P and t w^as posed as a boosted binary classification problem, v^here du{y, y') 
acts as a strong binary classifier, and each dimension of the linear projection sign(p^x + 1^) is considered 
a wtak classifier (here, denotes the iih row of P). This way, AdaBoost can be used to find a greedy 
approximation of the minimizer of by progressively constructing P and t. At the i-th iteration, the 
i-th row of the matrix P and the i-th element of the vector t are found minimizing a v^eighted version 
of Lc. Since the problem is non-linear, such an optimization is a challenging problem. In [l34l . random 
projection directions wtrt used. A better method for projection selection similar to linear discriminative 
analysis (LDA) was proposed [i7i|9J. Weights of false positive and false negative pairs are increased, and 
w^eights of true positive and true negative pairs are decreased, using the standard AdaBoost rev^eighting 
scheme [12J. 

3.2. Covariance difference hashing (diff-hash) 

In [35], it v^as observed that the minimization minp I/c(P) can be vv^ritten as 

mmtr{P(aC+ - C_)P^} s.t. P^P = I, (2) 

whtrt C± = E{(x — x^)(x — x^)^} are the covariance matrices of the differences of the positive and 
negative pairs of vectors. Requiring an orthonormal projection matrix P, the problem has a closed-form 
solution consisting of the m smallest eigenvectors of (aC+ — C_), and is thus also a separable problem. 
Once the projection is found in this way, the threshold vector t maximizing the sum of the false positive 
and false negative rates is selected. This second stage also turns out separable in each dimension. In [[HI, 
a more generic kernelized version of diff-hash (kdiff-hash) was show^n. 

3.3. LDAHash 

A similar method w^as derived in ll35ll by transforming the coordinates as CI^^^x, w^hich allow^s to 
v^rite minp L^(P) as 

mmtr{P(C+C:^)P^} s.t. P^P = I. (3) 

This approach resembles linear discriminant analysis (LDA), hence the name LDAhash. Requiring an 
orthonormal projection matrix P, the problem has a separable closed-form solution consisting of the m 
smallest eigenvectors of (C+Cl^). 

4. Neural network hashing (NNhash) 

The problem of existing and most successful similarity-preserving hashing approaches such as LDA- 
or diff-hash is that they do not solve the optimization problem minp^t ^(P, t) but rather its relaxation. As 
a result, the parameters P* , t* found by these methods in the aforementioned tvv^o- stage separable scheme 
is suboptimal, i.e., L(P*, t*) > minL. Our experience show^s that in some cases, the suboptimality is 
dramatic (at least an order of magnitude). 

A way of solving the 'true' optimization problem is by formulating it in the neural network (NN) 
framework and exploiting numerous optimization techniques and heuristics developed in this field. Since 



we have a way of cheaply producing labeled data, we decide to adopt the Siamese network architecture 
ll33llT5]| which, contrary to conventional models, receives two input patterns and minimize a loss function 
similar to equation ([2]), 

WP,t) = ^||y(x)-y(x+)||2 + i(max{0,m-||y(x)-y(x-)||}f, (4) 

where the constant m represents the margin between dissimilar pairs. The margin is introduced as reg- 
ularization to avoid the system from minimizing the loss just pulling two vector as far apart as possible. 
The embedding is then learned to make positive pairs as close as possible and negative pairs at least at 
distance m. 

Network architecture of this type can be traced back to the work of Schmidhuber and Prelinger ll33]| 
on problems of predictable classification. In [15J, Siamese networks were used to learn an invariant 
mapping of tiny images directly from pixel representation, whereas in OTll a similar approach is used 
to learn a model that is highly effective at matching people in similar pose which exhibits invariance to 
identity, clothing, background, lighting, shift and scale. An advantage of such architecture is that one 
can create arbitrarily complex embeddings by simply stacking many layers in the network. In all our 
experiments, in order to make a fair comparison to other hashing methods, we adopt a simple single 
layer architecture, wherein y(x) = sign(Px + t). Network training attempts to find P, t that minimize 
Lnn (which is a regularized version of L^). Since we solve a non-linear problem without introducing any 
simplification or relaxation, the results are expected to be better compared to hashing methods described 
in Section 3. In the following, we refer to our method as NNhash. 

Since a binary output is required, we adopt tanh(;5t) ^ sign(t) as the non-linear activation function 
for our Siamese network, which enforces binary vectors when either m or the steepness /3 of the function 
is increased. Since the problem is highly non-convex, it is liable to local convergence, and thus there 
is no theoretical guarantee to find the global minimum. However, by initializing P,t by the solution 
obtained by one of the standard hashing methods, we have a good initial point that can be improved by 
network optimization, 

5. Results 

5.1. Data 

In our experiments, we used the Rawseeds dataset ElfTOl]. The dataset contained video sequences of a 
robot equipped with an omnidirectional camera system based on a parabolic mirror moving in an indoor 
and outdoor scene. The image undergoes significant distortion since different parts of the scene move 
from the central part of the mirror to the boundaries. 

We used the toolbox of Vedaldi [40J to compute SIFT features in each frame of the video. Since the 
robot movement is slow, the change between two adjacent frames in the dataset is infinitesimal, and SIFT 
features can be matched reliably. Tracking features for multiple frames, we constructed the positive set 
as the transitive closure of these adjacent feature descriptor pairs. This way, the positive set included 
also descriptors distant in time, and, as a result of robot motion located at different regions in the image 
and thus subject to strong distortions. As negatives, we used features not belonging to the same track. 

In addition to the Rawseeds dataset, we created synthetic omnidirectional datasets using panorama 
images that were warped simulating the effect of a parabolic mirror. The warping intentionally was not 
the same as in Rawseeds dataset. By moving the panorama image, we created synthetic motion with 




Figure 1 : A few frames from the Rawseeds dataset examplifying how a descriptor changes over time due 
to camera motion throughout the scene. First row: omnidirectional images of the indoor dataset, shown 
at times 1 (left), 5 (middle) and 50 (right). Second row: SIFT descriptors at point indicated in red. Third 
row: binary descriptors of length 32 produced by NNhash trained on outdoor images. 

known pixel- wise groundtruth correspondence (Figure |5]). The positive and negative sets for synthetic 
data were constructed as described above. 

5.2. Methods 

We compared the SSH |[34l|, diff-hash [l35l|, and our NNhash methods. For the NNhash training we 
used scaled conjugate gradient over the whole batch of descriptors, which we normalize in the range 
[— 1..1]. We used a margin m = 5 in all cases. The steepness factor for tanh is ;5 = 1 in the case of 
32 bit while for 64 bit we gradually increased it up to 3 so to have a smooth binarization. We reached 
convergence in about 50 epochs in all cases. 

5.3. Performance degradation in time 

For this experiment, we constructed the training set using descriptors extracted from about 300 con- 
secutive frames of the outdoor sequence (similar results were obtained when using outdoor or synthetic 
data for training). We considered descriptors that could be tracked for at least 60 consecutive frames and 
selected as positives pairs of descriptors belonging to these tracks. 

To avoid bias, we selected pairs of descriptors in frames tj in such a way that the time difference 




Figure 2: ROC curve for the outdoor dataset, with frames taken at various distance At. Each hashing 
method is shown with 32 and 64 bits. Note significant performance degradation of SIFT and only minor 
performance degradation of NNhash. 

At = \ti — tj \ between the frames was uniformly distributed. The training was performed on a positive 
set of size 10^ and on a negative set of size 10^ to produce hashes of length 32 and 64 bits. 

Testing was performed on a different portion of the same sequence, where frames at distance 10 < 
At < 30 (Figure |2} left) and 20 < At < 40 (Figure |2} right) were used. A few phenomena can be 
observed in Figure |2] showing the ROC curves of straightforward SIFT matching using the Euclidean 
distance and matching of learned binary descriptors using the Hamming distance. First, we can see that 
even with very compact descriptors (as small as 64 bit, compared to 1024 bit required to represent SIFT) 
we match or outperform SIFT. These results are consistent with the study in B5ll . Second, we observe 
that NNhash significantly outperforms other hashing methods for the same number of bits. This is a 
clear indication that SSH and diff-hash methods are finding a suboptimal solution by solving a relaxed 
problem, while NNhash attempts to solve the full non-linear non-convex optimization problem. 

Comparing Figure [2] (left and right) and Tables [T]-[2} we can observe how the matching performance 
degrades if we increase the time between the frames (from 10 — 30 frames to 20 — 40 frames). Be- 
cause of significant distortions caused by the parabolic mirror, objects moving around the scene appear 
differently. This phenomenon is especially noticeable when the distance between the frames (At) is 
large. SIFT shows significant degradation, while NNhash, trained on a dataset including positive pairs 
at distances up to At = 60 degrades only slightly (even a 32-bit NNhash performs better than SIFT). 
This is a clear indication that we are able to learn feature invariance. 

Finally, Figure |4] shows a visual example of feature matching using different methods. NNhash pro- 
duces matches most similar to the groundtruth (shown in green). 





m 


EER 


FPR@1% 


FPR@0.1% 


SIFT 


1024 


1.91% 


3.08% 


13.87% 


NNhash 


32 


1.66% 


3.77% 


23.81% 




64 


1.31% 


1.92% 


9.48% 


Diffflash 


32 


4.41% 


9.36% 


29.95% 




64 


2.57% 


5.17% 


18.30% 


SSH 


32 


4.02% 


15.64% 


36.41% 




64 


2.22% 


4.90% 


16.74% 



Table 1 : Descriptor matching performance using different methods and descriptor size for frames with 
time range 10 < At < 30. 





m 


EER 


FPR@1% 


FPR@0.1% 


SIFT 


1024 


3.31% 


7.47% 


27.94% 


NNhash 


32 


2.70% 


6.98% 


24.98% 




64 


2.38% 


4.54% 


14.22% 


DiffHash 


32 


5.17% 


12.55% 


37.49% 




64 


3.69% 


8.75% 


27.34% 


SSH 


32 


5.52% 


24.10% 


47.29% 




64 


3.46% 


9.48% 


27.66% 



Table 2: Descriptor matching performance using different methods and descriptor size for frames with 
time range 20 < At < 40. 

5.4. Generalization 

To test for generahzation we perform experiments of transfer learning from outdoor data to indoor 
data and from synthetic data to real data. 

Figure [3]-left shows the performance of descriptors trained on outdoor and tested on indoor data. We 
can see that even though the data used for training is very different from the one used for testing (i.e. 
see Figure [T] and Figure |4] for a visual comparison) we achieve better performance than SIFT with just 
64 bits. Figure |3]-right shows the performance of descriptors trained on synthetic and tested on indoor 
data. All learning methods perform better than SIFT. The discrepancy between NNhash and the other 
algorithms is less pronounced that in the real case. 

6. Discussion, Conclusions, and Future Work 

We presented a new approach for feature matching in omnidirectional images based on similarity- 
sensitive hashing and inspired by the recent work ll35 1. We learn a mapping from the descriptor space 
to the space of binary vectors that preserves the similarity of descriptors on a training set. By carefully 
constructing the training set, we account for descriptor variability, e.g. due to optical distortions. The 
resulting descriptors are compact and are compared using the Hamming metric, offering significant 
computational advantage over other traditional metrics such as L2. Though tested with SIFT descriptors, 
our approach is generic and can be applied to any feature descriptor. 

We compared several existing similarity-preserving hashing methods, as well as our NNhash method 




Figure 3: Left: ROC curve for the models trained on outdoor data and tested on indoor data with 
descriptors taken at 35 < At < 60. Right: ROC curve for synthetic trained models. Testing performed 
on indoor real descriptors. 

based on a neural network. Experimental results show that NNhash outperforms other approaches. 
An explanation to this behavior is the fact that of today's state-of-the-art similarity-preserving hashing 
algorithms like SSH or LDAHash solve a simplified optimization problem, whose solution does not 
necessarily coincide with the solution of the "true" non-linear non-convex problem. We showed that 
using a neural network, we can solve the "true" problem and yield better performance. 

Finally, our discussion in this paper was limited to simple embeddings of the form sign(Px+t) which 
in some cases are too simple. The neural network framework seems to us a very natural way to consider 
more generic embeddings using multi-layer network architectures. 
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